Chapter 9 Constructions and Idioms
library(tidyverse)
library(tidytext)
library(quanteda)
library(stringr)
library(jiebaR)
library(readtext)9.1 Collostruction
In this chapter, I would like to talk about the relationship between a construction and words. Words may co-occur to form collocation patterns. When words co-occur with a particular morphosyntactic pattern, they would form collostruction patterns.
Here I would like to introduce a widely-applied method for research on the meanings of constructional schemas—Collostructional Aanalysis (Stefanowitsch and Gries 2003). This is the major framework in corpus linguistics for the study of the relationship between words and constructions.
The idea behind collostructional analysis is simple: the meaning of a morphosyntactic construction can be determined very often by its co-occurring words.
In particular, words that are strongly associated (i.e., co-occurring) with the construction are referred to as collexemes of the construction.
Collostruction Analysis is an umbrella term, which covers several sub-analyses for constructional semantics:
- collexeme analysis
- co-varying collexeme analysis
- distinctive collexeme analysis
This chapter will focus on the first one, collexeme analysis, whose principles can be extended to the other analyses.
Also, I will demonstrate how we can conduct a collexeme analysis by using the R script written by Stefan Gries (Collostructional Analysis).
9.2 Corpus
In this chapter, I will use the Apple News Corpus from Chapter 8 as our corpus. (It is available in: demo_data/applenews10000.tar.gz.)
And in this demonstration, I would like to look at a particular morphosyntactic frame in Chinese, X + 起來. Our goal is simple: in order to find out the semantics of this constructional schema, it would be very informative if we can find out which words tend to strongly occupy this X slot of the constructional schema.
So our first step is to load the text collections of Apple News into R and create a corpus object.
9.3 Word Segmentation
Because Apple News Corpus is a raw-text corpus, we first need to word-tokenize the corpus.
First we convert the corpus object into a tidy structure text-based tibble.
Second, because later we need to extract constructions from texts, we add a new column to our text-based corpus, which includes the segmented version of the texts utilizing the jiebaR segmenter.
# Initialize the segmenter
segmenter <- worker(user="demo_data/dict-ch-user.txt",
bylines = F,
symbol = T)
# Define own tokenization function
word_seg_text <- function(text, jiebar){
segment(text, jiebar) %>% # vector output
str_c(collapse=" ")
}
# From `corpus` to `tibble`
apple_df <- apple_corpus %>%
tidy %>%
filter(text !="") %>% #remove empty documents
mutate(doc_id = row_number())
# Tokenization
apple_df <- apple_df %>% # create doccument index
mutate(text_tag = map_chr(text, word_seg_text, segmenter))9.4 Extract Constructions
With the word boundary information, we can now extract our target patterns from the corpus using regular expressions with unnest_tokens().
# Define regex
pattern_qilai = "[^\\s]+\\s起來\\b"
# Extract patterns
apple_df %>%
select(-text) %>%
unnest_tokens(output = construction,
input = text_tag,
token = function(x) str_extract_all(x, pattern=pattern_qilai)) -> apple_qilai
# Print
apple_qilai9.5 Distributional Information Needed for CA
To perform the collostructional analysis, which is essentially a statistical analysis of the association between the words and the constructions, we need to collect necessary distributional information of the words and constructions.
In particular, to use Stefan Gries’ R script of Collostructional Analysis, we need the following information:
- Joint Frequencies of Words and Constructions
- Frequencies of Words in Corpus
- Corpus Size (total number of words in corpus)
- Construction Size (total number of constructions in corpus)
9.5.1 Word Frequency List
It is easy to get the word frequencies.
With the tokenized texts, we first convert the text-based tibble into a word-based one; then we create the word frequency list via simple data manipulation tricks.
# word freq
apple_df %>%
select(-text) %>%
unnest_tokens(word,
text_tag,
token = function(x) str_split(x, "\\s+|\u3000")) %>%
filter(nzchar(word)) %>%
count(word, sort = T) -> apple_word
apple_word9.5.2 Joint Frequencies
With all the extracted construction tokens, apple_qilai, it is also easy to get the joint frequencies of words and constructions as well as the construction frequencies.
# Joint frequency table
apple_qilai %>%
count(construction, sort=T) %>%
tidyr::separate(col="construction",
into = c("w1","construction"),
sep="\\s") %>%
mutate(w1_freq = apple_word$n[match(w1,apple_word$word)]) -> apple_qilai_table
apple_qilai_table9.5.3 Input for coll.analysis.r
Specifically, Stefan Gries’ coll.analysis.r expects a particular input format.
The input file should be a tsv file, which includes a three-column table:
- Words
- Word frequency in the corpus
- Word joint frequency with the construction
In the later Stefan Gries’ R script, we need to have our input as a tab-delimited file (tsv), not a comma-delimited file (csv).
9.5.4 Other Information
In addition to the input file, Stefan Gries’ coll.analysis.r also requires a few statistics for the computing of association measures.
We prepare necessary distributional information for the later collostructional analysis:
- Corpus size
- Construction size
## Corpus Size: 3209617
## Construction Size: 546
## Corpus Size: 3209617
## Construction Size: 546
9.5.5 Create Output File
Stefan Gries’ coll.analysis.r can automatically output the results into an external file.
Before running the CA script, we can first create an empty output txt file to keep the results from the CA script.
9.5.6 Run coll.analysis.r
Finally we are now ready to perform the collostructional analysis using Stefan Gries’ coll.analysis.r.
We can use source() to run an entire R script. The coll.analysis.r is availablel on Stefan Gries’ website. You can save the script onto your laptop and run it offline or source the online version directly.
Stefan Gries’ coll.analysis.r will initialize the analysis by first removing all the objects in your current R session. Please make sure that you have save all necerssary information/objects in your current R session before you source the script.
####################################
# WARNING!!!!!!!!!!!!!!! #
# The script re-starts a R session #
####################################
source("http://www.stgries.info/teaching/groningen/coll.analysis.r")coll.analysis.r is an R script with interactive instructions.
When you run the analysis, you will be prompted with guide questions, to which you would need to fill out necessary information/answers in the R console.
For our current example, the answers to be entered for each prompt include:
analysis to perform: 1name of construction: QILAIcorpus size: 3209617freq of constructions: 546index of association strength: 1 (=fisher-exact)sorting: 4 (=collostruction strength)decimals: 2text file with the raw data: <qilai.tsv>Where to save output: 1 (= text file)output file: <qilai_results.txt>
If everything works properly, you should get the output of coll.analysis.r as a text file qilai_results.txt in your working directory.
The text output may look as follows.

9.5.7 Interpretations
The output from coll.analysis.r is a text file with both the result data frame (i.e., the data frame with all the statistics) as well as detailed annotations/explanations provided by Stefan Gries.
We can also extract the result data frame from the text file. The output file from the collexeme analysis of QILAI has been made available in demo_data/qilai_results.txt.
To extract the result data frame from the output text file:
- We first load the result txt file like a normal text file using
readlines() - We extract the lines which include the statistics and parse them as a CSV into a data frame using
read_tsv()
# load the output txt
results <-readLines("demo_data/qilai_results.txt")
# subset lines
results<-results[-c(1:17, (length(results)-17):length(results))]
# convert into CSV
collo_table<-read_tsv(results)
# auto-print
collo_table %>%
filter(relation =="attraction") %>%
arrange(desc(coll.strength)) %>%
head(100) %>%
select(words, coll.strength, everything())With the collexeme analysis statistics, we can therefore explore the top N collexemes according to specific association metrics.
Here we look at the top 10 collexemes according to four different distributional metrics:
obs.freq: the raw joint frequency of the word and construction.delta.p.constr.to.word: the delta P of the construction to the worddelta.p.word.to.constr: the delta P of the word to the constructioncoll.strength: the log-transformed p-value based on Fisher exact test
# from wide to long
collo_table %>%
filter(relation == "attraction") %>%
filter(obs.freq >=5) %>%
select(words, obs.freq,
delta.p.constr.to.word,
delta.p.word.to.constr,
coll.strength) %>%
pivot_longer(cols=c("obs.freq",
"delta.p.constr.to.word",
"delta.p.word.to.constr",
"coll.strength"),
names_to = "metric",
values_to = "strength") %>%
mutate(metric = factor(metric,
levels = c("obs.freq",
"delta.p.constr.to.word",
"delta.p.word.to.constr",
"coll.strength"))) %>%
group_by(metric) %>%
top_n(10, strength) %>%
#arrange(strength) %>%
#mutate(strength_rank = row_number()) %>%
ungroup %>%
arrange(metric, desc(strength)) -> coll_table_long
# plot
graphs <- list()
for(i in levels(coll_table_long$metric)){
coll_table_long %>%
filter(metric %in% i) %>%
ggplot(aes(reorder(words, strength), strength, fill=strength)) +
geom_col(show.legend = F) +
coord_flip() +
labs(x = "Collexemes",
y = "Strength",
title = i)+
theme(text = element_text(family="Arial Unicode MS"))-> graphs[[i]]
}
require(ggpubr)
ggpubr::ggarrange(plotlist = graphs)
The bar plots above show the top 10 collexemes based on four different metrics: obs.freq, delta.p.contr.to.word, delta.p.word.to.contr, and coll.strength.
Many studies have shown that Chinese makes use of large proportion of four-character idioms in the discourse. Therefore, four-character idioms are often one of the hot topics in the study of constructions in Chinese.
In our demo_data directory, there is a file dict-ch-idiom.txt, which includes a list of four-character idioms in Chinese. These idioms are collected from 搜狗輸入法詞庫 and the original file formats (.scel) have been combined, removed of duplicate cases, and converted to a more machine-readable format, i.e., .txt.
You can load the dataset in R for more exploration of idioms.
all_idioms <- readLines(con = "demo_data/dict-ch-idiom.txt")
head(all_idioms)
tail(all_idioms)
length(all_idioms)9.6 Exercises
The following exercises should use the dataset Yet Another Chinese News Dataset from Kaggle.
The dataset is availabe on our dropbox demo_data/corpus-news-collection.csv.
The dataset is a collection of news articles in Traditional and Simplified Chinese, including some Internet news outlets that are NOT Chinese state media.
Exercise 9.1 Please conduct a collexeme analysis for the aspectual construction “X + 了” in Chinese.
Extract all tokens of this consturction from the news corpus and identify all words preceding the aspectual marker.
Based on the distributional information, conduct the collexemes analysis using thecoll.analysis.r and present the collexemes that significantly co-occur with the construction “X + 了” in the X slot. Rank the collexemes according to the collostrength provided by Stefan Gries’ script.
When you tokenize the texts using jiebaR, you may run into an error message as shown below. If you do, please figure out what leads to the issue and solve the problem on your own.

- It is suggested that you parse/tokenize the corpus data and create two columns to the text-based tibble—
text_id, andtext_tag. The following is an example of the first ten articles.
After my data preprocessing and tokenization, here is relevant distributional information for
coll.analysis.r:- Corpus Size: 8131049
- Consturction Size: 25618
The output of the Collexeme Analysis (
coll.analysis.r)
- When plotting the results, if you have
Infvalues in thecoll.strengthcolumn, please replace all theInfvalues with the maximum numeric value of thecoll.strengthcolumn.

Exercise 9.2 Using the same Chinese news corpus—demo_data/corpus-news-collection.csv, please create a frequency list of all four-character words/idioms that are included in the four-character idiom dictionary demo_data/dict-ch-idiom.txt.
Please include both the frequency as well as the dispersion of each four-character idiom in the corpus. Dispersion is defined as the number of articles where it is observed.
Please arrange the four-character idioms according to their dispersion.## user system elapsed
## 16.589 0.294 16.913
Exercise 9.3 Let’s assume that we are particularly interested in the idioms of the schema of X_X_, such as “一心一意”, “民脂民膏”, “滿坑滿谷” (i.e., idioms where the first character is the same as the third character).

Exercise 9.4 Continuing the previous exercise, the idioms of the schema X_X_ may have different types of X. Here we refer to the character X as the pivot of the idiom.

- For example, the type frequency of the most productive pivot schema, “不_不_”, is 21 in the news corpus. That is, there are 21 types of constructional variants of this schema, as shown below:
Exercise 9.5 Continuing the previous exercise, to further study the semantic uniqueness of each pivot schema, please identify the top 5 idioms of each pivot schema according to the frequencies of the idioms in the corpus.
Please present the results for schemas whose type frequencies >= 5 (i.e., the pivot schema has at least FIVE different idioms as its constructional instances).
Please visualize your results as shown below.
Exercise 9.6 Let’s assume that we are interested in how different media may use the four-character words differently.
Please show the average number of idioms per article by different media and visualize the results in bar plots as shown below.
The average number of idioms per article can be computed based on token frequency (i.e., on average how many idioms were observed per article?) or type frequency (i.e., on average how many different idiom types were observed per article?).- For example, there are 2529 tokens (1443 types) of idioms observed in the 1756 articles published by “Zaobao”. The average token frequency of idiom uses would be: 2529/1756 = 1.440205; the average type frequency of idiom uses would be: 1443/1756 = 0.821754.

References
Stefanowitsch, Anatol, and Stefan Th Gries. 2003. “Collostructions: Investigating the Interaction of Words and Constructions.” International Journal of Corpus Linguistics 8 (2). John Benjamins: 209–43.